Factorio analysis: data munging

Schwarzchild Radius of the Universe

Banach space and Hilbert space topology

black dwarf stars and dark matter

How do we improve the relationship with a client software team that performs poorly and is becoming less collaborative?

How do you conduct xenoanthropology after first contact?

Is it tax fraud for an individual to declare non-taxable revenue as taxable income? (US tax laws)

Pronouncing Dictionary.com's W.O.D "vade mecum" in English

Why was the small council so happy for Tyrion to become the Master of Coin?

Why Is Death Allowed In the Matrix?

Do airline pilots ever risk not hearing communication directed to them specifically, from traffic controllers?

Why are 150k or 200k jobs considered good when there are 300k+ births a month?

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Is it possible to do 50 km distance without any previous training?

Accidentally leaked the solution to an assignment, what to do now? (I'm the prof)

The magic money tree problem

I’m planning on buying a laser printer but concerned about the life cycle of toner in the machine

TGV timetables / schedules?

How can I fix this gap between bookcases I made?

How to add power-LED to my small amplifier?

How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)

Example of a relative pronoun

A function which translates a sentence to title-case

Why don't electron-positron collisions release infinite energy?

What are these boxed doors outside store fronts in New York?



Factorio analysis: data munging







.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0












$begingroup$


This project is... a little ridiculous. It's working, but it's a complete mess.



Data about Factorio's game economy are pulled from the wiki via the MediaWiki API, scrubbed, preprocessed, and thrown into Scipy for linear programming analysis using the MOSEK interior point method.



The pull script only depends on requests:



#!/usr/bin/env python3

import json, lzma, re
from os.path import getsize
from requests import Session
from sys import stdout

session = Session()


def get_mediawiki(content=False, progress=None, **kwargs):
"""
https://stable.wiki.factorio.com is an instance of MediaWiki.
The API endpoint is
https://stable.wiki.factorio.com/api.php
"""
params = 'action': 'query',
'format': 'json',
**kwargs
if content:
params.update('prop': 'revisions',
'rvprop': 'content')
so_far = 0
while True:
resp = session.get('https://stable.wiki.factorio.com/api.php',
params=params)
resp.raise_for_status()

doc = resp.json()
pages = doc['query']['pages'].values()
if content:
full_pages = tuple(p for p in pages if 'revisions' in p)
if progress:
so_far += len(full_pages)
progress(so_far, len(pages))
yield from full_pages
else:
yield from pages

if 'batchcomplete' in doc:
break
params.update(doc['continue'])


def get_category(name, content=False, progress=None, **kwargs):
return get_mediawiki(content=content, progress=progress,
generator='categorymembers',
gcmtitle=f'Category:name',
gcmtype='page',
gcmlimit=500,
**kwargs)


def get_archived_titles():
return get_category('Archived')


def get_infoboxes(progress):
return get_category('Infobox_page', content=True, progress=progress)


def get_inter_tables(titles, progress):
return get_mediawiki(content=True, progress=progress,
titles='|'.join(titles))


line_re = re.compile(r'ns*|')
var_re = re.compile(
r'^s*'
r'(S+)'
r's*=s*'
r'(.+?)'
r's*$')


def parse_infobox(page):
"""
Example:

map-color = 006090
<noinclude>
[[Category:Infobox page]]
</noinclude>

Splitting on newline isn't a great idea, because
https://www.mediawiki.org/wiki/Help:Templates#Named_parameters
shows that only the pipe is mandatory as a separator. However, only
splitting on pipe is worse, because there are pipes on the inside of links.
"""

content = page['revisions'][0]['*']
entries = (
var_re.match(e)
for e in line_re.split(
content.split('', maxsplit=1)[1]
.rsplit('', maxsplit=1)[0]
)
)
title = page['title'].split(':', maxsplit=1)[1]
d = 'pageid': page['pageid'],
'title': title
d.update(dict(e.groups() for e in entries if e))
return d


part_tok = r's*([^|]*?)'
border_tok = r's*|'
row_image_re = re.compile(
r's*'
r'(?P<type>w+)'
f'border_tok'
f'part_tok'
r'(?:'
f'border_tok'
f'part_tok'
r')?'
r'(?:'
f'border_tok'
r'[^]*'
r')?'
r's*'
r'(?P<sep>'
r'(?:'
r'|||+|→'
r')?'
r')',
)


def iter_cells(row):
"""
e.g.
|
|| 10 + 3
|| Solid fuel
or
| Oil refinery
|| Basic oil processing
|| Icon + Time
→ Icon + (Light oil 40)
"""

cell = []
for m in row_image_re.finditer(row):
if m.group('sep') == '||':
cell.append(m.groups()[:-1])
yield cell
cell = []
else:
cell.append(m.groups())
if cell:
yield cell


def parse_inter_table(page):
"""
Example:



or

-

"""
title = page['title']
content = page['revisions'][0]['*']
if '|]+)}')

def wood_mining(self) -> Iterable[MiningRecipe]:
miners = tuple(
ManualMiner(tool)
for tool in all_items.values()
if tool.prototype_type == 'mining-tool'
)
for m in self.tree_re.finditer(self.resource.mining_time):
mining_time, source = int(m[1]), m[2]
for miner in miners:
yield self.produce(
MiningRecipe, miner,
mining_hardness=float(self.resource.mining_hardness),
mining_time=mining_time,
title=f'self.resource (miner from source)')

def make(self) -> Iterable[Recipe]:
if self.rates:
if self.resource.prototype_type == 'technology':
yield self.produce(
TechRecipe, self.producers[0],
cost_multiplier=float(self.resource.cost_multiplier))
elif self.resource.title == 'Energy':
yield self.produce(Recipe, self.producers[0])
else:
yield from self.for_energy(Recipe)
elif self.resource.title == 'Raw wood':
yield from self.wood_mining()
elif self.resource.mining_time:
yield from self.for_energy(
MiningRecipe,
mining_hardness=float(self.resource.mining_hardness),
mining_time=float(self.resource.mining_time))
elif self.resource.title == 'Crude oil':
yield from self.for_energy(FluidRecipe)
elif self.resource.title == 'Water':
yield self.produce(FluidRecipe, self.producers[0])
else:
raise NotImplementedError()


def parse_power(s: str) -> float:
m = power_re.search(s)
return float(m[1]) * si_facs[m[2]]


def items_of_type(t: str) -> Iterable[Item]:
return (i for i in all_items.values()
if i.prototype_type == t)


barrel_re = re.compile(r'empty .+ barrel')


def parse_producers(s: str) -> Iterable[Item]:
for p in s.split('+'):
p = p.strip().lower()
if p == 'furnace':
yield from items_of_type('furnace')
elif p == 'assembling machine':
yield from (all_items[f'assembling machine i']
for i in range(1, 4))
elif p == 'mining drill':
yield from (all_items[f't mining drill']
for t in ('burner', 'electric'))
elif p == 'manual' or barrel_re.match(p):
continue
else:
yield all_items[p]


def trim(items: dict):
to_delete = tuple(k for k, v in items.items() if not v.keep)
print(f'Dropping len(to_delete) items...')
for k in to_delete:
del items[k]


def energy_data() -> dict:
solar_ave = parse_power(next(
s for s in all_items['solar panel'].power_output.split('<br/>')
if 'average' in s))

eng = all_items['steam engine']
eng_rate = float(eng.fluid_consumption
.split('/')[0])
eng_power = parse_power(eng.power_output)

turbine = all_items['steam turbine']
turbine_rate = float(turbine.fluid_consumption
.split('/')[0])
turbine_power_500 = 5.82e6 # ignore non-precise data and use this instead
turbine_power_165 = 1.8e6 # from wiki page body

return
'title': 'Energy',
'recipes': (

'building': 'Solar panel',
'process': 'Energy (Solar panel)',
'inputs':
'Time': 1
,
'outputs':
'Energy': solar_ave

,

'building': 'Steam engine',
'process': 'Energy (Steam engine)',
'inputs':
'Time': 1,
'Steam165': eng_rate
,
'outputs':
'Energy': eng_power

,

'building': 'Steam turbine',
'process': 'Energy (Steam turbine @ 165C)',
'inputs':
'Time': 1,
'Steam165': turbine_rate
,
'outputs':
'Energy': turbine_power_165

,

'building': 'Steam turbine',
'process': 'Energy (Steam turbine @ 500C)',
'inputs':
'Time': 1,
'Steam500': turbine_rate
,
'outputs':
'Energy': turbine_power_500


)



def load(fn: str):
with lzma.open(fn) as f:
global all_items
all_items = k.lower(): Item(d) for k, d in json.load(f).items()
all_items['energy'] = Item(energy_data())


def get_recipes() -> (Dict[str, Recipe], Set[str]):
recipes =
resources = set()
for item in all_items.values():
item_recipes = tuple(item.get_recipes())
recipes.update(i.title: i for i in item_recipes)
for recipe in item_recipes:
resources.update(recipe.rates.keys())

return recipes, resources


def field_size(names: Iterable) -> int:
return max(len(str(o)) for o in names)


def write_csv_for_r(recipes: Sequence[Recipe], resources: Sequence[str],
fn: str):
# Recipes going down, resources going right

rec_width = field_size(recipes)
float_width = 15
col_format = f':float_width+8'
rec_format = 'n:' + str(rec_width+1) + ''

with lzma.open(fn, 'wt') as f:
f.write(' '*(rec_width+1))
for res in resources:
f.write(col_format.format(f'res,'))

for rec in recipes:
f.write(rec_format.format(f'rec,'))
for res in resources:
x = rec.rates.get(res, 0)
col_format = f':+len(res).float_widthe,'
f.write(col_format.format(x))


def write_for_numpy(recipes: Sequence[Recipe], resources: Sequence[str],
meta_fn: str, npz_fn: str):
rec_names = [r.title for r in recipes]
w_rec = max(len(r) for r in rec_names)
recipe_names = np.array(rec_names, copy=False, dtype=f'Uw_rec')

w_res = max(len(r) for r in resources)
resource_names = np.array(resources, copy=False, dtype=f'Uw_res')

np.savez_compressed(meta_fn, recipe_names=recipe_names, resource_names=resource_names)

rec_mat = lil_matrix((len(resources), len(recipes)))
for j, rec in enumerate(recipes):
for res, q in rec.rates.items():
i = resources.index(res)
rec_mat[i, j] = q
save_npz(npz_fn, rec_mat.tocsr())


def file_banner(fn):
print(f'fn getsize(fn)//1024 kiB')


def main():
fn = 'items.json.xz'
print(f'Loading fn... ', end='')
load(fn)
print(f'len(all_items) items')

trim(all_items)

print('Calculating recipes... ', end='')
recipes, resources = get_recipes()
print(f'len(recipes) recipes, len(resources) resources')

resources = sorted(resources)
recipes = sorted(recipes.values(), key=lambda i: i.title)

print('Saving files for numpy...')
meta_fn, npz_fn = 'recipe-names.npz', 'recipes.npz'
write_for_numpy(recipes, resources, meta_fn, npz_fn)
file_banner(meta_fn)
file_banner(npz_fn)

fn = 'recipes.csv.xz'
print(f'Saving recipes for use by R...')
stdout.flush()
write_csv_for_r(recipes, resources, fn)
file_banner(fn)


if __name__ == '__main__':
main()


That's followed by an analysis script that I won't post here, to constrain the scope of this first review.



The main thing that needs work is the recipe factory code. It sprinkles logic about item types where it doesn't belong, and that really needs to be improved. I have some ideas about how to do that, but I'd like to hear from the community (on that, and any other wrinkles you find).









share









$endgroup$


















    0












    $begingroup$


    This project is... a little ridiculous. It's working, but it's a complete mess.



    Data about Factorio's game economy are pulled from the wiki via the MediaWiki API, scrubbed, preprocessed, and thrown into Scipy for linear programming analysis using the MOSEK interior point method.



    The pull script only depends on requests:



    #!/usr/bin/env python3

    import json, lzma, re
    from os.path import getsize
    from requests import Session
    from sys import stdout

    session = Session()


    def get_mediawiki(content=False, progress=None, **kwargs):
    """
    https://stable.wiki.factorio.com is an instance of MediaWiki.
    The API endpoint is
    https://stable.wiki.factorio.com/api.php
    """
    params = 'action': 'query',
    'format': 'json',
    **kwargs
    if content:
    params.update('prop': 'revisions',
    'rvprop': 'content')
    so_far = 0
    while True:
    resp = session.get('https://stable.wiki.factorio.com/api.php',
    params=params)
    resp.raise_for_status()

    doc = resp.json()
    pages = doc['query']['pages'].values()
    if content:
    full_pages = tuple(p for p in pages if 'revisions' in p)
    if progress:
    so_far += len(full_pages)
    progress(so_far, len(pages))
    yield from full_pages
    else:
    yield from pages

    if 'batchcomplete' in doc:
    break
    params.update(doc['continue'])


    def get_category(name, content=False, progress=None, **kwargs):
    return get_mediawiki(content=content, progress=progress,
    generator='categorymembers',
    gcmtitle=f'Category:name',
    gcmtype='page',
    gcmlimit=500,
    **kwargs)


    def get_archived_titles():
    return get_category('Archived')


    def get_infoboxes(progress):
    return get_category('Infobox_page', content=True, progress=progress)


    def get_inter_tables(titles, progress):
    return get_mediawiki(content=True, progress=progress,
    titles='|'.join(titles))


    line_re = re.compile(r'ns*|')
    var_re = re.compile(
    r'^s*'
    r'(S+)'
    r's*=s*'
    r'(.+?)'
    r's*$')


    def parse_infobox(page):
    """
    Example:

    map-color = 006090
    <noinclude>
    [[Category:Infobox page]]
    </noinclude>

    Splitting on newline isn't a great idea, because
    https://www.mediawiki.org/wiki/Help:Templates#Named_parameters
    shows that only the pipe is mandatory as a separator. However, only
    splitting on pipe is worse, because there are pipes on the inside of links.
    """

    content = page['revisions'][0]['*']
    entries = (
    var_re.match(e)
    for e in line_re.split(
    content.split('', maxsplit=1)[1]
    .rsplit('', maxsplit=1)[0]
    )
    )
    title = page['title'].split(':', maxsplit=1)[1]
    d = 'pageid': page['pageid'],
    'title': title
    d.update(dict(e.groups() for e in entries if e))
    return d


    part_tok = r's*([^|]*?)'
    border_tok = r's*|'
    row_image_re = re.compile(
    r's*'
    r'(?P<type>w+)'
    f'border_tok'
    f'part_tok'
    r'(?:'
    f'border_tok'
    f'part_tok'
    r')?'
    r'(?:'
    f'border_tok'
    r'[^]*'
    r')?'
    r's*'
    r'(?P<sep>'
    r'(?:'
    r'|||+|→'
    r')?'
    r')',
    )


    def iter_cells(row):
    """
    e.g.
    |
    || 10 + 3
    || Solid fuel
    or
    | Oil refinery
    || Basic oil processing
    || Icon + Time
    → Icon + (Light oil 40)
    """

    cell = []
    for m in row_image_re.finditer(row):
    if m.group('sep') == '||':
    cell.append(m.groups()[:-1])
    yield cell
    cell = []
    else:
    cell.append(m.groups())
    if cell:
    yield cell


    def parse_inter_table(page):
    """
    Example:



    or

    -

    """
    title = page['title']
    content = page['revisions'][0]['*']
    if '|]+)}')

    def wood_mining(self) -> Iterable[MiningRecipe]:
    miners = tuple(
    ManualMiner(tool)
    for tool in all_items.values()
    if tool.prototype_type == 'mining-tool'
    )
    for m in self.tree_re.finditer(self.resource.mining_time):
    mining_time, source = int(m[1]), m[2]
    for miner in miners:
    yield self.produce(
    MiningRecipe, miner,
    mining_hardness=float(self.resource.mining_hardness),
    mining_time=mining_time,
    title=f'self.resource (miner from source)')

    def make(self) -> Iterable[Recipe]:
    if self.rates:
    if self.resource.prototype_type == 'technology':
    yield self.produce(
    TechRecipe, self.producers[0],
    cost_multiplier=float(self.resource.cost_multiplier))
    elif self.resource.title == 'Energy':
    yield self.produce(Recipe, self.producers[0])
    else:
    yield from self.for_energy(Recipe)
    elif self.resource.title == 'Raw wood':
    yield from self.wood_mining()
    elif self.resource.mining_time:
    yield from self.for_energy(
    MiningRecipe,
    mining_hardness=float(self.resource.mining_hardness),
    mining_time=float(self.resource.mining_time))
    elif self.resource.title == 'Crude oil':
    yield from self.for_energy(FluidRecipe)
    elif self.resource.title == 'Water':
    yield self.produce(FluidRecipe, self.producers[0])
    else:
    raise NotImplementedError()


    def parse_power(s: str) -> float:
    m = power_re.search(s)
    return float(m[1]) * si_facs[m[2]]


    def items_of_type(t: str) -> Iterable[Item]:
    return (i for i in all_items.values()
    if i.prototype_type == t)


    barrel_re = re.compile(r'empty .+ barrel')


    def parse_producers(s: str) -> Iterable[Item]:
    for p in s.split('+'):
    p = p.strip().lower()
    if p == 'furnace':
    yield from items_of_type('furnace')
    elif p == 'assembling machine':
    yield from (all_items[f'assembling machine i']
    for i in range(1, 4))
    elif p == 'mining drill':
    yield from (all_items[f't mining drill']
    for t in ('burner', 'electric'))
    elif p == 'manual' or barrel_re.match(p):
    continue
    else:
    yield all_items[p]


    def trim(items: dict):
    to_delete = tuple(k for k, v in items.items() if not v.keep)
    print(f'Dropping len(to_delete) items...')
    for k in to_delete:
    del items[k]


    def energy_data() -> dict:
    solar_ave = parse_power(next(
    s for s in all_items['solar panel'].power_output.split('<br/>')
    if 'average' in s))

    eng = all_items['steam engine']
    eng_rate = float(eng.fluid_consumption
    .split('/')[0])
    eng_power = parse_power(eng.power_output)

    turbine = all_items['steam turbine']
    turbine_rate = float(turbine.fluid_consumption
    .split('/')[0])
    turbine_power_500 = 5.82e6 # ignore non-precise data and use this instead
    turbine_power_165 = 1.8e6 # from wiki page body

    return
    'title': 'Energy',
    'recipes': (

    'building': 'Solar panel',
    'process': 'Energy (Solar panel)',
    'inputs':
    'Time': 1
    ,
    'outputs':
    'Energy': solar_ave

    ,

    'building': 'Steam engine',
    'process': 'Energy (Steam engine)',
    'inputs':
    'Time': 1,
    'Steam165': eng_rate
    ,
    'outputs':
    'Energy': eng_power

    ,

    'building': 'Steam turbine',
    'process': 'Energy (Steam turbine @ 165C)',
    'inputs':
    'Time': 1,
    'Steam165': turbine_rate
    ,
    'outputs':
    'Energy': turbine_power_165

    ,

    'building': 'Steam turbine',
    'process': 'Energy (Steam turbine @ 500C)',
    'inputs':
    'Time': 1,
    'Steam500': turbine_rate
    ,
    'outputs':
    'Energy': turbine_power_500


    )



    def load(fn: str):
    with lzma.open(fn) as f:
    global all_items
    all_items = k.lower(): Item(d) for k, d in json.load(f).items()
    all_items['energy'] = Item(energy_data())


    def get_recipes() -> (Dict[str, Recipe], Set[str]):
    recipes =
    resources = set()
    for item in all_items.values():
    item_recipes = tuple(item.get_recipes())
    recipes.update(i.title: i for i in item_recipes)
    for recipe in item_recipes:
    resources.update(recipe.rates.keys())

    return recipes, resources


    def field_size(names: Iterable) -> int:
    return max(len(str(o)) for o in names)


    def write_csv_for_r(recipes: Sequence[Recipe], resources: Sequence[str],
    fn: str):
    # Recipes going down, resources going right

    rec_width = field_size(recipes)
    float_width = 15
    col_format = f':float_width+8'
    rec_format = 'n:' + str(rec_width+1) + ''

    with lzma.open(fn, 'wt') as f:
    f.write(' '*(rec_width+1))
    for res in resources:
    f.write(col_format.format(f'res,'))

    for rec in recipes:
    f.write(rec_format.format(f'rec,'))
    for res in resources:
    x = rec.rates.get(res, 0)
    col_format = f':+len(res).float_widthe,'
    f.write(col_format.format(x))


    def write_for_numpy(recipes: Sequence[Recipe], resources: Sequence[str],
    meta_fn: str, npz_fn: str):
    rec_names = [r.title for r in recipes]
    w_rec = max(len(r) for r in rec_names)
    recipe_names = np.array(rec_names, copy=False, dtype=f'Uw_rec')

    w_res = max(len(r) for r in resources)
    resource_names = np.array(resources, copy=False, dtype=f'Uw_res')

    np.savez_compressed(meta_fn, recipe_names=recipe_names, resource_names=resource_names)

    rec_mat = lil_matrix((len(resources), len(recipes)))
    for j, rec in enumerate(recipes):
    for res, q in rec.rates.items():
    i = resources.index(res)
    rec_mat[i, j] = q
    save_npz(npz_fn, rec_mat.tocsr())


    def file_banner(fn):
    print(f'fn getsize(fn)//1024 kiB')


    def main():
    fn = 'items.json.xz'
    print(f'Loading fn... ', end='')
    load(fn)
    print(f'len(all_items) items')

    trim(all_items)

    print('Calculating recipes... ', end='')
    recipes, resources = get_recipes()
    print(f'len(recipes) recipes, len(resources) resources')

    resources = sorted(resources)
    recipes = sorted(recipes.values(), key=lambda i: i.title)

    print('Saving files for numpy...')
    meta_fn, npz_fn = 'recipe-names.npz', 'recipes.npz'
    write_for_numpy(recipes, resources, meta_fn, npz_fn)
    file_banner(meta_fn)
    file_banner(npz_fn)

    fn = 'recipes.csv.xz'
    print(f'Saving recipes for use by R...')
    stdout.flush()
    write_csv_for_r(recipes, resources, fn)
    file_banner(fn)


    if __name__ == '__main__':
    main()


    That's followed by an analysis script that I won't post here, to constrain the scope of this first review.



    The main thing that needs work is the recipe factory code. It sprinkles logic about item types where it doesn't belong, and that really needs to be improved. I have some ideas about how to do that, but I'd like to hear from the community (on that, and any other wrinkles you find).









    share









    $endgroup$














      0












      0








      0





      $begingroup$


      This project is... a little ridiculous. It's working, but it's a complete mess.



      Data about Factorio's game economy are pulled from the wiki via the MediaWiki API, scrubbed, preprocessed, and thrown into Scipy for linear programming analysis using the MOSEK interior point method.



      The pull script only depends on requests:



      #!/usr/bin/env python3

      import json, lzma, re
      from os.path import getsize
      from requests import Session
      from sys import stdout

      session = Session()


      def get_mediawiki(content=False, progress=None, **kwargs):
      """
      https://stable.wiki.factorio.com is an instance of MediaWiki.
      The API endpoint is
      https://stable.wiki.factorio.com/api.php
      """
      params = 'action': 'query',
      'format': 'json',
      **kwargs
      if content:
      params.update('prop': 'revisions',
      'rvprop': 'content')
      so_far = 0
      while True:
      resp = session.get('https://stable.wiki.factorio.com/api.php',
      params=params)
      resp.raise_for_status()

      doc = resp.json()
      pages = doc['query']['pages'].values()
      if content:
      full_pages = tuple(p for p in pages if 'revisions' in p)
      if progress:
      so_far += len(full_pages)
      progress(so_far, len(pages))
      yield from full_pages
      else:
      yield from pages

      if 'batchcomplete' in doc:
      break
      params.update(doc['continue'])


      def get_category(name, content=False, progress=None, **kwargs):
      return get_mediawiki(content=content, progress=progress,
      generator='categorymembers',
      gcmtitle=f'Category:name',
      gcmtype='page',
      gcmlimit=500,
      **kwargs)


      def get_archived_titles():
      return get_category('Archived')


      def get_infoboxes(progress):
      return get_category('Infobox_page', content=True, progress=progress)


      def get_inter_tables(titles, progress):
      return get_mediawiki(content=True, progress=progress,
      titles='|'.join(titles))


      line_re = re.compile(r'ns*|')
      var_re = re.compile(
      r'^s*'
      r'(S+)'
      r's*=s*'
      r'(.+?)'
      r's*$')


      def parse_infobox(page):
      """
      Example:

      map-color = 006090
      <noinclude>
      [[Category:Infobox page]]
      </noinclude>

      Splitting on newline isn't a great idea, because
      https://www.mediawiki.org/wiki/Help:Templates#Named_parameters
      shows that only the pipe is mandatory as a separator. However, only
      splitting on pipe is worse, because there are pipes on the inside of links.
      """

      content = page['revisions'][0]['*']
      entries = (
      var_re.match(e)
      for e in line_re.split(
      content.split('', maxsplit=1)[1]
      .rsplit('', maxsplit=1)[0]
      )
      )
      title = page['title'].split(':', maxsplit=1)[1]
      d = 'pageid': page['pageid'],
      'title': title
      d.update(dict(e.groups() for e in entries if e))
      return d


      part_tok = r's*([^|]*?)'
      border_tok = r's*|'
      row_image_re = re.compile(
      r's*'
      r'(?P<type>w+)'
      f'border_tok'
      f'part_tok'
      r'(?:'
      f'border_tok'
      f'part_tok'
      r')?'
      r'(?:'
      f'border_tok'
      r'[^]*'
      r')?'
      r's*'
      r'(?P<sep>'
      r'(?:'
      r'|||+|→'
      r')?'
      r')',
      )


      def iter_cells(row):
      """
      e.g.
      |
      || 10 + 3
      || Solid fuel
      or
      | Oil refinery
      || Basic oil processing
      || Icon + Time
      → Icon + (Light oil 40)
      """

      cell = []
      for m in row_image_re.finditer(row):
      if m.group('sep') == '||':
      cell.append(m.groups()[:-1])
      yield cell
      cell = []
      else:
      cell.append(m.groups())
      if cell:
      yield cell


      def parse_inter_table(page):
      """
      Example:



      or

      -

      """
      title = page['title']
      content = page['revisions'][0]['*']
      if '|]+)}')

      def wood_mining(self) -> Iterable[MiningRecipe]:
      miners = tuple(
      ManualMiner(tool)
      for tool in all_items.values()
      if tool.prototype_type == 'mining-tool'
      )
      for m in self.tree_re.finditer(self.resource.mining_time):
      mining_time, source = int(m[1]), m[2]
      for miner in miners:
      yield self.produce(
      MiningRecipe, miner,
      mining_hardness=float(self.resource.mining_hardness),
      mining_time=mining_time,
      title=f'self.resource (miner from source)')

      def make(self) -> Iterable[Recipe]:
      if self.rates:
      if self.resource.prototype_type == 'technology':
      yield self.produce(
      TechRecipe, self.producers[0],
      cost_multiplier=float(self.resource.cost_multiplier))
      elif self.resource.title == 'Energy':
      yield self.produce(Recipe, self.producers[0])
      else:
      yield from self.for_energy(Recipe)
      elif self.resource.title == 'Raw wood':
      yield from self.wood_mining()
      elif self.resource.mining_time:
      yield from self.for_energy(
      MiningRecipe,
      mining_hardness=float(self.resource.mining_hardness),
      mining_time=float(self.resource.mining_time))
      elif self.resource.title == 'Crude oil':
      yield from self.for_energy(FluidRecipe)
      elif self.resource.title == 'Water':
      yield self.produce(FluidRecipe, self.producers[0])
      else:
      raise NotImplementedError()


      def parse_power(s: str) -> float:
      m = power_re.search(s)
      return float(m[1]) * si_facs[m[2]]


      def items_of_type(t: str) -> Iterable[Item]:
      return (i for i in all_items.values()
      if i.prototype_type == t)


      barrel_re = re.compile(r'empty .+ barrel')


      def parse_producers(s: str) -> Iterable[Item]:
      for p in s.split('+'):
      p = p.strip().lower()
      if p == 'furnace':
      yield from items_of_type('furnace')
      elif p == 'assembling machine':
      yield from (all_items[f'assembling machine i']
      for i in range(1, 4))
      elif p == 'mining drill':
      yield from (all_items[f't mining drill']
      for t in ('burner', 'electric'))
      elif p == 'manual' or barrel_re.match(p):
      continue
      else:
      yield all_items[p]


      def trim(items: dict):
      to_delete = tuple(k for k, v in items.items() if not v.keep)
      print(f'Dropping len(to_delete) items...')
      for k in to_delete:
      del items[k]


      def energy_data() -> dict:
      solar_ave = parse_power(next(
      s for s in all_items['solar panel'].power_output.split('<br/>')
      if 'average' in s))

      eng = all_items['steam engine']
      eng_rate = float(eng.fluid_consumption
      .split('/')[0])
      eng_power = parse_power(eng.power_output)

      turbine = all_items['steam turbine']
      turbine_rate = float(turbine.fluid_consumption
      .split('/')[0])
      turbine_power_500 = 5.82e6 # ignore non-precise data and use this instead
      turbine_power_165 = 1.8e6 # from wiki page body

      return
      'title': 'Energy',
      'recipes': (

      'building': 'Solar panel',
      'process': 'Energy (Solar panel)',
      'inputs':
      'Time': 1
      ,
      'outputs':
      'Energy': solar_ave

      ,

      'building': 'Steam engine',
      'process': 'Energy (Steam engine)',
      'inputs':
      'Time': 1,
      'Steam165': eng_rate
      ,
      'outputs':
      'Energy': eng_power

      ,

      'building': 'Steam turbine',
      'process': 'Energy (Steam turbine @ 165C)',
      'inputs':
      'Time': 1,
      'Steam165': turbine_rate
      ,
      'outputs':
      'Energy': turbine_power_165

      ,

      'building': 'Steam turbine',
      'process': 'Energy (Steam turbine @ 500C)',
      'inputs':
      'Time': 1,
      'Steam500': turbine_rate
      ,
      'outputs':
      'Energy': turbine_power_500


      )



      def load(fn: str):
      with lzma.open(fn) as f:
      global all_items
      all_items = k.lower(): Item(d) for k, d in json.load(f).items()
      all_items['energy'] = Item(energy_data())


      def get_recipes() -> (Dict[str, Recipe], Set[str]):
      recipes =
      resources = set()
      for item in all_items.values():
      item_recipes = tuple(item.get_recipes())
      recipes.update(i.title: i for i in item_recipes)
      for recipe in item_recipes:
      resources.update(recipe.rates.keys())

      return recipes, resources


      def field_size(names: Iterable) -> int:
      return max(len(str(o)) for o in names)


      def write_csv_for_r(recipes: Sequence[Recipe], resources: Sequence[str],
      fn: str):
      # Recipes going down, resources going right

      rec_width = field_size(recipes)
      float_width = 15
      col_format = f':float_width+8'
      rec_format = 'n:' + str(rec_width+1) + ''

      with lzma.open(fn, 'wt') as f:
      f.write(' '*(rec_width+1))
      for res in resources:
      f.write(col_format.format(f'res,'))

      for rec in recipes:
      f.write(rec_format.format(f'rec,'))
      for res in resources:
      x = rec.rates.get(res, 0)
      col_format = f':+len(res).float_widthe,'
      f.write(col_format.format(x))


      def write_for_numpy(recipes: Sequence[Recipe], resources: Sequence[str],
      meta_fn: str, npz_fn: str):
      rec_names = [r.title for r in recipes]
      w_rec = max(len(r) for r in rec_names)
      recipe_names = np.array(rec_names, copy=False, dtype=f'Uw_rec')

      w_res = max(len(r) for r in resources)
      resource_names = np.array(resources, copy=False, dtype=f'Uw_res')

      np.savez_compressed(meta_fn, recipe_names=recipe_names, resource_names=resource_names)

      rec_mat = lil_matrix((len(resources), len(recipes)))
      for j, rec in enumerate(recipes):
      for res, q in rec.rates.items():
      i = resources.index(res)
      rec_mat[i, j] = q
      save_npz(npz_fn, rec_mat.tocsr())


      def file_banner(fn):
      print(f'fn getsize(fn)//1024 kiB')


      def main():
      fn = 'items.json.xz'
      print(f'Loading fn... ', end='')
      load(fn)
      print(f'len(all_items) items')

      trim(all_items)

      print('Calculating recipes... ', end='')
      recipes, resources = get_recipes()
      print(f'len(recipes) recipes, len(resources) resources')

      resources = sorted(resources)
      recipes = sorted(recipes.values(), key=lambda i: i.title)

      print('Saving files for numpy...')
      meta_fn, npz_fn = 'recipe-names.npz', 'recipes.npz'
      write_for_numpy(recipes, resources, meta_fn, npz_fn)
      file_banner(meta_fn)
      file_banner(npz_fn)

      fn = 'recipes.csv.xz'
      print(f'Saving recipes for use by R...')
      stdout.flush()
      write_csv_for_r(recipes, resources, fn)
      file_banner(fn)


      if __name__ == '__main__':
      main()


      That's followed by an analysis script that I won't post here, to constrain the scope of this first review.



      The main thing that needs work is the recipe factory code. It sprinkles logic about item types where it doesn't belong, and that really needs to be improved. I have some ideas about how to do that, but I'd like to hear from the community (on that, and any other wrinkles you find).









      share









      $endgroup$




      This project is... a little ridiculous. It's working, but it's a complete mess.



      Data about Factorio's game economy are pulled from the wiki via the MediaWiki API, scrubbed, preprocessed, and thrown into Scipy for linear programming analysis using the MOSEK interior point method.



      The pull script only depends on requests:



      #!/usr/bin/env python3

      import json, lzma, re
      from os.path import getsize
      from requests import Session
      from sys import stdout

      session = Session()


      def get_mediawiki(content=False, progress=None, **kwargs):
      """
      https://stable.wiki.factorio.com is an instance of MediaWiki.
      The API endpoint is
      https://stable.wiki.factorio.com/api.php
      """
      params = 'action': 'query',
      'format': 'json',
      **kwargs
      if content:
      params.update('prop': 'revisions',
      'rvprop': 'content')
      so_far = 0
      while True:
      resp = session.get('https://stable.wiki.factorio.com/api.php',
      params=params)
      resp.raise_for_status()

      doc = resp.json()
      pages = doc['query']['pages'].values()
      if content:
      full_pages = tuple(p for p in pages if 'revisions' in p)
      if progress:
      so_far += len(full_pages)
      progress(so_far, len(pages))
      yield from full_pages
      else:
      yield from pages

      if 'batchcomplete' in doc:
      break
      params.update(doc['continue'])


      def get_category(name, content=False, progress=None, **kwargs):
      return get_mediawiki(content=content, progress=progress,
      generator='categorymembers',
      gcmtitle=f'Category:name',
      gcmtype='page',
      gcmlimit=500,
      **kwargs)


      def get_archived_titles():
      return get_category('Archived')


      def get_infoboxes(progress):
      return get_category('Infobox_page', content=True, progress=progress)


      def get_inter_tables(titles, progress):
      return get_mediawiki(content=True, progress=progress,
      titles='|'.join(titles))


      line_re = re.compile(r'ns*|')
      var_re = re.compile(
      r'^s*'
      r'(S+)'
      r's*=s*'
      r'(.+?)'
      r's*$')


      def parse_infobox(page):
      """
      Example:

      map-color = 006090
      <noinclude>
      [[Category:Infobox page]]
      </noinclude>

      Splitting on newline isn't a great idea, because
      https://www.mediawiki.org/wiki/Help:Templates#Named_parameters
      shows that only the pipe is mandatory as a separator. However, only
      splitting on pipe is worse, because there are pipes on the inside of links.
      """

      content = page['revisions'][0]['*']
      entries = (
      var_re.match(e)
      for e in line_re.split(
      content.split('', maxsplit=1)[1]
      .rsplit('', maxsplit=1)[0]
      )
      )
      title = page['title'].split(':', maxsplit=1)[1]
      d = 'pageid': page['pageid'],
      'title': title
      d.update(dict(e.groups() for e in entries if e))
      return d


      part_tok = r's*([^|]*?)'
      border_tok = r's*|'
      row_image_re = re.compile(
      r's*'
      r'(?P<type>w+)'
      f'border_tok'
      f'part_tok'
      r'(?:'
      f'border_tok'
      f'part_tok'
      r')?'
      r'(?:'
      f'border_tok'
      r'[^]*'
      r')?'
      r's*'
      r'(?P<sep>'
      r'(?:'
      r'|||+|→'
      r')?'
      r')',
      )


      def iter_cells(row):
      """
      e.g.
      |
      || 10 + 3
      || Solid fuel
      or
      | Oil refinery
      || Basic oil processing
      || Icon + Time
      → Icon + (Light oil 40)
      """

      cell = []
      for m in row_image_re.finditer(row):
      if m.group('sep') == '||':
      cell.append(m.groups()[:-1])
      yield cell
      cell = []
      else:
      cell.append(m.groups())
      if cell:
      yield cell


      def parse_inter_table(page):
      """
      Example:



      or

      -

      """
      title = page['title']
      content = page['revisions'][0]['*']
      if '|]+)}')

      def wood_mining(self) -> Iterable[MiningRecipe]:
      miners = tuple(
      ManualMiner(tool)
      for tool in all_items.values()
      if tool.prototype_type == 'mining-tool'
      )
      for m in self.tree_re.finditer(self.resource.mining_time):
      mining_time, source = int(m[1]), m[2]
      for miner in miners:
      yield self.produce(
      MiningRecipe, miner,
      mining_hardness=float(self.resource.mining_hardness),
      mining_time=mining_time,
      title=f'self.resource (miner from source)')

      def make(self) -> Iterable[Recipe]:
      if self.rates:
      if self.resource.prototype_type == 'technology':
      yield self.produce(
      TechRecipe, self.producers[0],
      cost_multiplier=float(self.resource.cost_multiplier))
      elif self.resource.title == 'Energy':
      yield self.produce(Recipe, self.producers[0])
      else:
      yield from self.for_energy(Recipe)
      elif self.resource.title == 'Raw wood':
      yield from self.wood_mining()
      elif self.resource.mining_time:
      yield from self.for_energy(
      MiningRecipe,
      mining_hardness=float(self.resource.mining_hardness),
      mining_time=float(self.resource.mining_time))
      elif self.resource.title == 'Crude oil':
      yield from self.for_energy(FluidRecipe)
      elif self.resource.title == 'Water':
      yield self.produce(FluidRecipe, self.producers[0])
      else:
      raise NotImplementedError()


      def parse_power(s: str) -> float:
      m = power_re.search(s)
      return float(m[1]) * si_facs[m[2]]


      def items_of_type(t: str) -> Iterable[Item]:
      return (i for i in all_items.values()
      if i.prototype_type == t)


      barrel_re = re.compile(r'empty .+ barrel')


      def parse_producers(s: str) -> Iterable[Item]:
      for p in s.split('+'):
      p = p.strip().lower()
      if p == 'furnace':
      yield from items_of_type('furnace')
      elif p == 'assembling machine':
      yield from (all_items[f'assembling machine i']
      for i in range(1, 4))
      elif p == 'mining drill':
      yield from (all_items[f't mining drill']
      for t in ('burner', 'electric'))
      elif p == 'manual' or barrel_re.match(p):
      continue
      else:
      yield all_items[p]


      def trim(items: dict):
      to_delete = tuple(k for k, v in items.items() if not v.keep)
      print(f'Dropping len(to_delete) items...')
      for k in to_delete:
      del items[k]


      def energy_data() -> dict:
      solar_ave = parse_power(next(
      s for s in all_items['solar panel'].power_output.split('<br/>')
      if 'average' in s))

      eng = all_items['steam engine']
      eng_rate = float(eng.fluid_consumption
      .split('/')[0])
      eng_power = parse_power(eng.power_output)

      turbine = all_items['steam turbine']
      turbine_rate = float(turbine.fluid_consumption
      .split('/')[0])
      turbine_power_500 = 5.82e6 # ignore non-precise data and use this instead
      turbine_power_165 = 1.8e6 # from wiki page body

      return
      'title': 'Energy',
      'recipes': (

      'building': 'Solar panel',
      'process': 'Energy (Solar panel)',
      'inputs':
      'Time': 1
      ,
      'outputs':
      'Energy': solar_ave

      ,

      'building': 'Steam engine',
      'process': 'Energy (Steam engine)',
      'inputs':
      'Time': 1,
      'Steam165': eng_rate
      ,
      'outputs':
      'Energy': eng_power

      ,

      'building': 'Steam turbine',
      'process': 'Energy (Steam turbine @ 165C)',
      'inputs':
      'Time': 1,
      'Steam165': turbine_rate
      ,
      'outputs':
      'Energy': turbine_power_165

      ,

      'building': 'Steam turbine',
      'process': 'Energy (Steam turbine @ 500C)',
      'inputs':
      'Time': 1,
      'Steam500': turbine_rate
      ,
      'outputs':
      'Energy': turbine_power_500


      )



      def load(fn: str):
      with lzma.open(fn) as f:
      global all_items
      all_items = k.lower(): Item(d) for k, d in json.load(f).items()
      all_items['energy'] = Item(energy_data())


      def get_recipes() -> (Dict[str, Recipe], Set[str]):
      recipes =
      resources = set()
      for item in all_items.values():
      item_recipes = tuple(item.get_recipes())
      recipes.update(i.title: i for i in item_recipes)
      for recipe in item_recipes:
      resources.update(recipe.rates.keys())

      return recipes, resources


      def field_size(names: Iterable) -> int:
      return max(len(str(o)) for o in names)


      def write_csv_for_r(recipes: Sequence[Recipe], resources: Sequence[str],
      fn: str):
      # Recipes going down, resources going right

      rec_width = field_size(recipes)
      float_width = 15
      col_format = f':float_width+8'
      rec_format = 'n:' + str(rec_width+1) + ''

      with lzma.open(fn, 'wt') as f:
      f.write(' '*(rec_width+1))
      for res in resources:
      f.write(col_format.format(f'res,'))

      for rec in recipes:
      f.write(rec_format.format(f'rec,'))
      for res in resources:
      x = rec.rates.get(res, 0)
      col_format = f':+len(res).float_widthe,'
      f.write(col_format.format(x))


      def write_for_numpy(recipes: Sequence[Recipe], resources: Sequence[str],
      meta_fn: str, npz_fn: str):
      rec_names = [r.title for r in recipes]
      w_rec = max(len(r) for r in rec_names)
      recipe_names = np.array(rec_names, copy=False, dtype=f'Uw_rec')

      w_res = max(len(r) for r in resources)
      resource_names = np.array(resources, copy=False, dtype=f'Uw_res')

      np.savez_compressed(meta_fn, recipe_names=recipe_names, resource_names=resource_names)

      rec_mat = lil_matrix((len(resources), len(recipes)))
      for j, rec in enumerate(recipes):
      for res, q in rec.rates.items():
      i = resources.index(res)
      rec_mat[i, j] = q
      save_npz(npz_fn, rec_mat.tocsr())


      def file_banner(fn):
      print(f'fn getsize(fn)//1024 kiB')


      def main():
      fn = 'items.json.xz'
      print(f'Loading fn... ', end='')
      load(fn)
      print(f'len(all_items) items')

      trim(all_items)

      print('Calculating recipes... ', end='')
      recipes, resources = get_recipes()
      print(f'len(recipes) recipes, len(resources) resources')

      resources = sorted(resources)
      recipes = sorted(recipes.values(), key=lambda i: i.title)

      print('Saving files for numpy...')
      meta_fn, npz_fn = 'recipe-names.npz', 'recipes.npz'
      write_for_numpy(recipes, resources, meta_fn, npz_fn)
      file_banner(meta_fn)
      file_banner(npz_fn)

      fn = 'recipes.csv.xz'
      print(f'Saving recipes for use by R...')
      stdout.flush()
      write_csv_for_r(recipes, resources, fn)
      file_banner(fn)


      if __name__ == '__main__':
      main()


      That's followed by an analysis script that I won't post here, to constrain the scope of this first review.



      The main thing that needs work is the recipe factory code. It sprinkles logic about item types where it doesn't belong, and that really needs to be improved. I have some ideas about how to do that, but I'd like to hear from the community (on that, and any other wrinkles you find).







      python numpy scipy





      share












      share










      share



      share










      asked 3 mins ago









      ReinderienReinderien

      5,280926




      5,280926




















          0






          active

          oldest

          votes












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217047%2ffactorio-analysis-data-munging%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217047%2ffactorio-analysis-data-munging%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          कुँवर स्रोत दिक्चालन सूची"कुँवर""राणा कुँवरके वंशावली"

          शेव्रोले वोल्ट अनुक्रम इतिहास इन्हे भी देखें चित्र दीर्घा संदर्भ दिक्चालन सूची

          चैत्य भूमि चित्र दीर्घा सन्दर्भ बाहरी कडियाँ दिक्चालन सूची"Chaitya Bhoomi""Chaitya Bhoomi: Statue of Equality in India""Dadar Chaitya Bhoomi: Statue of Equality in India""Ambedkar memorial: Centre okays transfer of Indu Mill land"चैत्यभमि