Word frequencies from large body of scraped text












3












$begingroup$


I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    I've added the fixed code in one piece below. Thank you!
    $endgroup$
    – Des Grieux
    Dec 23 '18 at 0:22
















3












$begingroup$


I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    I've added the fixed code in one piece below. Thank you!
    $endgroup$
    – Des Grieux
    Dec 23 '18 at 0:22














3












3








3


1



$begingroup$


I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?










share|improve this question











$endgroup$




I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.



Here is a snippet from the original file:




 1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środ­kach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.



My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.



Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:



# -*- coding: utf-8 -*-

import io
import pandas as pd
import numpy as np

num_batches = 54

for i in range(1 ,num_batches +1):

infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'

with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:

entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]

data = pd.DataFrame({"word": , "freq": })

for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]

freq_dict = dict()
keys = np.unique(data['word'])

for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]

for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?







python performance dictionary lookup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 23 '18 at 1:34







Des Grieux

















asked Dec 23 '18 at 0:05









Des GrieuxDes Grieux

355




355








  • 1




    $begingroup$
    I've added the fixed code in one piece below. Thank you!
    $endgroup$
    – Des Grieux
    Dec 23 '18 at 0:22














  • 1




    $begingroup$
    I've added the fixed code in one piece below. Thank you!
    $endgroup$
    – Des Grieux
    Dec 23 '18 at 0:22








1




1




$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22




$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22










3 Answers
3






active

oldest

votes


















5












$begingroup$

for i in range(1 ,num_batches +1):


Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



This string:



r'input_batch_' + str(i) + r'.txt'


can be:



f'input_batch_{i}.txt'


This code:



entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]


can also be simplified, to:



entries = [line.rstrip().split('t') for line in infile]


Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



This is an antipattern inherited from C:



for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]


You should instead do:



for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]


That also applies to your for x in range(len(data)):.



This:



freq_dict = dict()


should be:



freq_dict = {}


This:



if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]


can be simplified to:



prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq


or even (courtesy @AlexHall):



freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)


Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



This loop:



for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))


needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')





share|improve this answer











$endgroup$









  • 1




    $begingroup$
    The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
    $endgroup$
    – Alex Hall
    Dec 23 '18 at 19:07












  • $begingroup$
    @AlexHall Good eyes. Edited.
    $endgroup$
    – Reinderien
    Dec 24 '18 at 1:04



















3












$begingroup$

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



from collections import Counter

yourListOfWords = [...]

frequencyOfEachWord = Counter(yourListOfWords)





share|improve this answer









$endgroup$





















    2












    $begingroup$

    To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:



    import io
    from collections import Counter
    import regex as re # the normal re module does not support p{P}...

    def read_file(file_name):
    """Reads a file into a Counter object.

    File contains rows with counts and words.
    Words can be multiple words separated by punctuation or whitespace.
    If that is the case, separate them.
    """
    counter = Counter()
    with io.open(file_name, 'r', encoding = 'utf8') as infile:
    for line in infile:
    if not line:
    continue
    freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
    # split on punctuation and whitespace
    words = re.split(r'p{P}|s', words)
    # update all words
    for word in filter(None, words): # filter out empty strings
    counter[word] += int(freq)
    return counter

    def write_file(file_name, counter):
    with io.open(file_name, 'w', encoding='utf8') as outfile:
    outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter


    if __name__ == "__main__":
    num_batches = 54
    for i in range(1, num_batches + 1):
    counter = read_file(f"input_batch_{i}.txt")
    write_file(f"output_batch_{i}.txt", counter)


    This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.






    share|improve this answer









    $endgroup$














      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      5












      $begingroup$

      for i in range(1 ,num_batches +1):


      Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



      This string:



      r'input_batch_' + str(i) + r'.txt'


      can be:



      f'input_batch_{i}.txt'


      This code:



      entries_raw = infile.readlines()
      entries_single = [x.strip() for x in entries_raw]
      entries = [x.split('t') for x in entries_single]


      can also be simplified, to:



      entries = [line.rstrip().split('t') for line in infile]


      Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



      This is an antipattern inherited from C:



      for j in range(len(entries)):
      data.loc[j] = entries[j][1], entries[j][0]


      You should instead do:



      for j, entry in enumerate(entries):
      data.loc[j] = entry[1], entry[0]


      That also applies to your for x in range(len(data)):.



      This:



      freq_dict = dict()


      should be:



      freq_dict = {}


      This:



      if key in freq_dict:
      prior_freq = freq_dict.get(key)
      freq_dict[key] = prior_freq + data['freq'][x]
      else:
      freq_dict[key] = data['freq'][x]


      can be simplified to:



      prior_freq = freq_dict.get(key)
      freq_dict[key] = data['freq'][x]
      if prior_freq is not None:
      freq_dict[key] += prior_freq


      or even (courtesy @AlexHall):



      freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)


      Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



      This loop:



      for key in freq_dict.keys():
      outfile.write("%s,%sn" % (key, freq_dict[key]))


      needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



      for key, freq in freq_dict.items():
      outfile.write(f'{key},{freq}n')





      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
        $endgroup$
        – Alex Hall
        Dec 23 '18 at 19:07












      • $begingroup$
        @AlexHall Good eyes. Edited.
        $endgroup$
        – Reinderien
        Dec 24 '18 at 1:04
















      5












      $begingroup$

      for i in range(1 ,num_batches +1):


      Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



      This string:



      r'input_batch_' + str(i) + r'.txt'


      can be:



      f'input_batch_{i}.txt'


      This code:



      entries_raw = infile.readlines()
      entries_single = [x.strip() for x in entries_raw]
      entries = [x.split('t') for x in entries_single]


      can also be simplified, to:



      entries = [line.rstrip().split('t') for line in infile]


      Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



      This is an antipattern inherited from C:



      for j in range(len(entries)):
      data.loc[j] = entries[j][1], entries[j][0]


      You should instead do:



      for j, entry in enumerate(entries):
      data.loc[j] = entry[1], entry[0]


      That also applies to your for x in range(len(data)):.



      This:



      freq_dict = dict()


      should be:



      freq_dict = {}


      This:



      if key in freq_dict:
      prior_freq = freq_dict.get(key)
      freq_dict[key] = prior_freq + data['freq'][x]
      else:
      freq_dict[key] = data['freq'][x]


      can be simplified to:



      prior_freq = freq_dict.get(key)
      freq_dict[key] = data['freq'][x]
      if prior_freq is not None:
      freq_dict[key] += prior_freq


      or even (courtesy @AlexHall):



      freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)


      Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



      This loop:



      for key in freq_dict.keys():
      outfile.write("%s,%sn" % (key, freq_dict[key]))


      needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



      for key, freq in freq_dict.items():
      outfile.write(f'{key},{freq}n')





      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
        $endgroup$
        – Alex Hall
        Dec 23 '18 at 19:07












      • $begingroup$
        @AlexHall Good eyes. Edited.
        $endgroup$
        – Reinderien
        Dec 24 '18 at 1:04














      5












      5








      5





      $begingroup$

      for i in range(1 ,num_batches +1):


      Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



      This string:



      r'input_batch_' + str(i) + r'.txt'


      can be:



      f'input_batch_{i}.txt'


      This code:



      entries_raw = infile.readlines()
      entries_single = [x.strip() for x in entries_raw]
      entries = [x.split('t') for x in entries_single]


      can also be simplified, to:



      entries = [line.rstrip().split('t') for line in infile]


      Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



      This is an antipattern inherited from C:



      for j in range(len(entries)):
      data.loc[j] = entries[j][1], entries[j][0]


      You should instead do:



      for j, entry in enumerate(entries):
      data.loc[j] = entry[1], entry[0]


      That also applies to your for x in range(len(data)):.



      This:



      freq_dict = dict()


      should be:



      freq_dict = {}


      This:



      if key in freq_dict:
      prior_freq = freq_dict.get(key)
      freq_dict[key] = prior_freq + data['freq'][x]
      else:
      freq_dict[key] = data['freq'][x]


      can be simplified to:



      prior_freq = freq_dict.get(key)
      freq_dict[key] = data['freq'][x]
      if prior_freq is not None:
      freq_dict[key] += prior_freq


      or even (courtesy @AlexHall):



      freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)


      Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



      This loop:



      for key in freq_dict.keys():
      outfile.write("%s,%sn" % (key, freq_dict[key]))


      needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



      for key, freq in freq_dict.items():
      outfile.write(f'{key},{freq}n')





      share|improve this answer











      $endgroup$



      for i in range(1 ,num_batches +1):


      Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.



      This string:



      r'input_batch_' + str(i) + r'.txt'


      can be:



      f'input_batch_{i}.txt'


      This code:



      entries_raw = infile.readlines()
      entries_single = [x.strip() for x in entries_raw]
      entries = [x.split('t') for x in entries_single]


      can also be simplified, to:



      entries = [line.rstrip().split('t') for line in infile]


      Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.



      This is an antipattern inherited from C:



      for j in range(len(entries)):
      data.loc[j] = entries[j][1], entries[j][0]


      You should instead do:



      for j, entry in enumerate(entries):
      data.loc[j] = entry[1], entry[0]


      That also applies to your for x in range(len(data)):.



      This:



      freq_dict = dict()


      should be:



      freq_dict = {}


      This:



      if key in freq_dict:
      prior_freq = freq_dict.get(key)
      freq_dict[key] = prior_freq + data['freq'][x]
      else:
      freq_dict[key] = data['freq'][x]


      can be simplified to:



      prior_freq = freq_dict.get(key)
      freq_dict[key] = data['freq'][x]
      if prior_freq is not None:
      freq_dict[key] += prior_freq


      or even (courtesy @AlexHall):



      freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)


      Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).



      This loop:



      for key in freq_dict.keys():
      outfile.write("%s,%sn" % (key, freq_dict[key]))


      needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:



      for key, freq in freq_dict.items():
      outfile.write(f'{key},{freq}n')






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Dec 24 '18 at 1:02

























      answered Dec 23 '18 at 1:35









      ReinderienReinderien

      5,215926




      5,215926








      • 1




        $begingroup$
        The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
        $endgroup$
        – Alex Hall
        Dec 23 '18 at 19:07












      • $begingroup$
        @AlexHall Good eyes. Edited.
        $endgroup$
        – Reinderien
        Dec 24 '18 at 1:04














      • 1




        $begingroup$
        The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
        $endgroup$
        – Alex Hall
        Dec 23 '18 at 19:07












      • $begingroup$
        @AlexHall Good eyes. Edited.
        $endgroup$
        – Reinderien
        Dec 24 '18 at 1:04








      1




      1




      $begingroup$
      The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
      $endgroup$
      – Alex Hall
      Dec 23 '18 at 19:07






      $begingroup$
      The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
      $endgroup$
      – Alex Hall
      Dec 23 '18 at 19:07














      $begingroup$
      @AlexHall Good eyes. Edited.
      $endgroup$
      – Reinderien
      Dec 24 '18 at 1:04




      $begingroup$
      @AlexHall Good eyes. Edited.
      $endgroup$
      – Reinderien
      Dec 24 '18 at 1:04













      3












      $begingroup$

      Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



      from collections import Counter

      yourListOfWords = [...]

      frequencyOfEachWord = Counter(yourListOfWords)





      share|improve this answer









      $endgroup$


















        3












        $begingroup$

        Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



        from collections import Counter

        yourListOfWords = [...]

        frequencyOfEachWord = Counter(yourListOfWords)





        share|improve this answer









        $endgroup$
















          3












          3








          3





          $begingroup$

          Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



          from collections import Counter

          yourListOfWords = [...]

          frequencyOfEachWord = Counter(yourListOfWords)





          share|improve this answer









          $endgroup$



          Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:



          from collections import Counter

          yourListOfWords = [...]

          frequencyOfEachWord = Counter(yourListOfWords)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Dec 23 '18 at 1:58









          AleksandrHAleksandrH

          20829




          20829























              2












              $begingroup$

              To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:



              import io
              from collections import Counter
              import regex as re # the normal re module does not support p{P}...

              def read_file(file_name):
              """Reads a file into a Counter object.

              File contains rows with counts and words.
              Words can be multiple words separated by punctuation or whitespace.
              If that is the case, separate them.
              """
              counter = Counter()
              with io.open(file_name, 'r', encoding = 'utf8') as infile:
              for line in infile:
              if not line:
              continue
              freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
              # split on punctuation and whitespace
              words = re.split(r'p{P}|s', words)
              # update all words
              for word in filter(None, words): # filter out empty strings
              counter[word] += int(freq)
              return counter

              def write_file(file_name, counter):
              with io.open(file_name, 'w', encoding='utf8') as outfile:
              outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter


              if __name__ == "__main__":
              num_batches = 54
              for i in range(1, num_batches + 1):
              counter = read_file(f"input_batch_{i}.txt")
              write_file(f"output_batch_{i}.txt", counter)


              This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.






              share|improve this answer









              $endgroup$


















                2












                $begingroup$

                To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:



                import io
                from collections import Counter
                import regex as re # the normal re module does not support p{P}...

                def read_file(file_name):
                """Reads a file into a Counter object.

                File contains rows with counts and words.
                Words can be multiple words separated by punctuation or whitespace.
                If that is the case, separate them.
                """
                counter = Counter()
                with io.open(file_name, 'r', encoding = 'utf8') as infile:
                for line in infile:
                if not line:
                continue
                freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
                # split on punctuation and whitespace
                words = re.split(r'p{P}|s', words)
                # update all words
                for word in filter(None, words): # filter out empty strings
                counter[word] += int(freq)
                return counter

                def write_file(file_name, counter):
                with io.open(file_name, 'w', encoding='utf8') as outfile:
                outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter


                if __name__ == "__main__":
                num_batches = 54
                for i in range(1, num_batches + 1):
                counter = read_file(f"input_batch_{i}.txt")
                write_file(f"output_batch_{i}.txt", counter)


                This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.






                share|improve this answer









                $endgroup$
















                  2












                  2








                  2





                  $begingroup$

                  To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:



                  import io
                  from collections import Counter
                  import regex as re # the normal re module does not support p{P}...

                  def read_file(file_name):
                  """Reads a file into a Counter object.

                  File contains rows with counts and words.
                  Words can be multiple words separated by punctuation or whitespace.
                  If that is the case, separate them.
                  """
                  counter = Counter()
                  with io.open(file_name, 'r', encoding = 'utf8') as infile:
                  for line in infile:
                  if not line:
                  continue
                  freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
                  # split on punctuation and whitespace
                  words = re.split(r'p{P}|s', words)
                  # update all words
                  for word in filter(None, words): # filter out empty strings
                  counter[word] += int(freq)
                  return counter

                  def write_file(file_name, counter):
                  with io.open(file_name, 'w', encoding='utf8') as outfile:
                  outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter


                  if __name__ == "__main__":
                  num_batches = 54
                  for i in range(1, num_batches + 1):
                  counter = read_file(f"input_batch_{i}.txt")
                  write_file(f"output_batch_{i}.txt", counter)


                  This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.






                  share|improve this answer









                  $endgroup$



                  To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:



                  import io
                  from collections import Counter
                  import regex as re # the normal re module does not support p{P}...

                  def read_file(file_name):
                  """Reads a file into a Counter object.

                  File contains rows with counts and words.
                  Words can be multiple words separated by punctuation or whitespace.
                  If that is the case, separate them.
                  """
                  counter = Counter()
                  with io.open(file_name, 'r', encoding = 'utf8') as infile:
                  for line in infile:
                  if not line:
                  continue
                  freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
                  # split on punctuation and whitespace
                  words = re.split(r'p{P}|s', words)
                  # update all words
                  for word in filter(None, words): # filter out empty strings
                  counter[word] += int(freq)
                  return counter

                  def write_file(file_name, counter):
                  with io.open(file_name, 'w', encoding='utf8') as outfile:
                  outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter


                  if __name__ == "__main__":
                  num_batches = 54
                  for i in range(1, num_batches + 1):
                  counter = read_file(f"input_batch_{i}.txt")
                  write_file(f"output_batch_{i}.txt", counter)


                  This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 23 '18 at 11:12









                  GraipherGraipher

                  26.6k54092




                  26.6k54092






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Code Review Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Bundesstraße 106

                      Verónica Boquete

                      Ida-Boy-Ed-Garten